101 research outputs found
An Empirical Investigation of Global and Local Normalization for Recurrent Neural Sequence Models Using a Continuous Relaxation to Beam Search
Globally normalized neural sequence models are considered superior to their
locally normalized equivalents because they may ameliorate the effects of label
bias. However, when considering high-capacity neural parametrizations that
condition on the whole input sequence, both model classes are theoretically
equivalent in terms of the distributions they are capable of representing.
Thus, the practical advantage of global normalization in the context of modern
neural methods remains unclear. In this paper, we attempt to shed light on this
problem through an empirical study. We extend an approach for search-aware
training via a continuous relaxation of beam search (Goyal et al., 2017b) in
order to enable training of globally normalized recurrent sequence models
through simple backpropagation. We then use this technique to conduct an
empirical study of the interaction between global normalization, high-capacity
encoders, and search-aware optimization. We observe that in the context of
inexact search, globally normalized neural models are still more effective than
their locally normalized counterparts. Further, since our training approach is
sensitive to warm-starting with pre-trained models, we also propose a novel
initialization strategy based on self-normalization for pre-training globally
normalized models. We perform analysis of our approach on two tasks: CCG
supertagging and Machine Translation, and demonstrate the importance of global
normalization under different conditions while using search-aware training.Comment: Long paper at NAACL 201
Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints
We present a discriminative model for single-document summarization that
integrally combines compression and anaphoricity constraints. Our model selects
textual units to include in the summary based on a rich set of sparse features
whose weights are learned on a large corpus. We allow for the deletion of
content within a sentence when that deletion is licensed by compression rules;
in our framework, these are implemented as dependencies between subsentential
units of text. Anaphoricity constraints then improve cross-sentence coherence
by guaranteeing that, for each pronoun included in the summary, the pronoun's
antecedent is included as well or the pronoun is rewritten as a full mention.
When trained end-to-end, our final system outperforms prior work on both ROUGE
as well as on human judgments of linguistic quality.Comment: ACL 201
Differentiable Scheduled Sampling for Credit Assignment
We demonstrate that a continuous relaxation of the argmax operation can be
used to create a differentiable approximation to greedy decoding for
sequence-to-sequence (seq2seq) models. By incorporating this approximation into
the scheduled sampling training procedure (Bengio et al., 2015)--a well-known
technique for correcting exposure bias--we introduce a new training objective
that is continuous and differentiable everywhere and that can provide
informative gradients near points where previous decoding decisions change
their value. In addition, by using a related approximation, we demonstrate a
similar approach to sampled-based training. Finally, we show that our approach
outperforms cross-entropy training and scheduled sampling procedures in two
sequence prediction tasks: named entity recognition and machine translation.Comment: Accepted at ACL2017 (http://bit.ly/2oj1muX
Visual Referring Expression Recognition: What Do Systems Actually Learn?
We present an empirical analysis of the state-of-the-art systems for
referring expression recognition -- the task of identifying the object in an
image referred to by a natural language expression -- with the goal of gaining
insight into how these systems reason about language and vision. Surprisingly,
we find strong evidence that even sophisticated and linguistically-motivated
models for this task may ignore the linguistic structure, instead relying on
shallow correlations introduced by unintended biases in the data selection and
annotation process. For example, we show that a system trained and tested on
the input image can achieve a
precision of 71.2% in top-2 predictions. Furthermore, a system that predicts
only the object category given the input can achieve a precision of 84.2% in
top-2 predictions. These surprisingly positive results for what should be
deficient prediction scenarios suggest that careful analysis of what our models
are learning -- and further, how our data is constructed -- is critical as we
seek to make substantive progress on grounded language tasks.Comment: NAACL2018 shor
A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models
Beam search is a desirable choice of test-time decoding algorithm for neural
sequence models because it potentially avoids search errors made by simpler
greedy methods. However, typical cross entropy training procedures for these
models do not directly consider the behaviour of the final decoding method. As
a result, for cross-entropy trained models, beam decoding can sometimes yield
reduced test performance when compared with greedy decoding. In order to train
models that can more effectively make use of beam search, we propose a new
training procedure that focuses on the final loss metric (e.g. Hamming loss)
evaluated on the output of beam search. While well-defined, this "direct loss"
objective is itself discontinuous and thus difficult to optimize. Hence, in our
approach, we form a sub-differentiable surrogate objective by introducing a
novel continuous approximation of the beam search decoding procedure. In
experiments, we show that optimizing this new training objective yields
substantially better results on two sequence tasks (Named Entity Recognition
and CCG Supertagging) when compared with both cross entropy trained greedy
decoding and cross entropy trained beam decoding baselines.Comment: Updated for clarity and notational consistenc
Improved Variational Autoencoders for Text Modeling using Dilated Convolutions
Recent work on generative modeling of text has found that variational
auto-encoders (VAE) incorporating LSTM decoders perform worse than simpler LSTM
language models (Bowman et al., 2015). This negative result is so far poorly
understood, but has been attributed to the propensity of LSTM decoders to
ignore conditioning information from the encoder. In this paper, we experiment
with a new type of decoder for VAE: a dilated CNN. By changing the decoder's
dilation architecture, we control the effective context from previously
generated words. In experiments, we find that there is a trade off between the
contextual capacity of the decoder and the amount of encoding information used.
We show that with the right decoder, VAE can outperform LSTM language models.
We demonstrate perplexity gains on two datasets, representing the first
positive experimental result on the use VAE for generative modeling of text.
Further, we conduct an in-depth investigation of the use of VAE (with our new
decoding architecture) for semi-supervised and unsupervised labeling tasks,
demonstrating gains over several strong baselines.Comment: camera read
A Probabilistic Formulation of Unsupervised Text Style Transfer
We present a deep generative model for unsupervised text style transfer that
unifies previously proposed non-generative techniques. Our probabilistic
approach models non-parallel data from two domains as a partially observed
parallel corpus. By hypothesizing a parallel latent sequence that generates
each observed sequence, our model learns to transform sequences from one domain
to another in a completely unsupervised fashion. In contrast with traditional
generative sequence models (e.g. the HMM), our model makes few assumptions
about the data it generates: it uses a recurrent language model as a prior and
an encoder-decoder as a transduction distribution. While computation of
marginal data likelihood is intractable in this model class, we show that
amortized variational inference admits a practical surrogate. Further, by
drawing connections between our variational objective and other recent
unsupervised style transfer and machine translation techniques, we show how our
probabilistic view can unify some known non-generative objectives such as
backtranslation and adversarial loss. Finally, we demonstrate the effectiveness
of our method on a wide range of unsupervised style transfer tasks, including
sentiment transfer, formality transfer, word decipherment, author imitation,
and related language translation. Across all style transfer tasks, our approach
yields substantial gains over state-of-the-art non-generative baselines,
including the state-of-the-art unsupervised machine translation techniques that
our approach generalizes. Further, we conduct experiments on a standard
unsupervised machine translation task and find that our unified approach
matches the current state-of-the-art.Comment: ICLR 2020 conference paper (spotlight). The first two authors
contributed equall
An Empirical Investigation of Contextualized Number Prediction
We conduct a large scale empirical investigation of contextualized number
prediction in running text. Specifically, we consider two tasks: (1)masked
number prediction-predicting a missing numerical value within a sentence, and
(2)numerical anomaly detection-detecting an errorful numeric value within a
sentence. We experiment with novel combinations of contextual encoders and
output distributions over the real number line. Specifically, we introduce a
suite of output distribution parameterizations that incorporate latent
variables to add expressivity and better fit the natural distribution of
numeric values in running text, and combine them with both recurrent and
transformer-based encoder architectures. We evaluate these models on two
numeric datasets in the financial and scientific domain. Our findings show that
output distributions that incorporate discrete latent variables and allow for
multiple modes outperform simple flow-based counterparts on all datasets,
yielding more accurate numerical prediction and anomaly detection. We also show
that our models effectively utilize textual con-text and benefit from
general-purpose unsupervised pretraining
Learning to Describe Differences Between Pairs of Similar Images
In this paper, we introduce the task of automatically generating text to
describe the differences between two similar images. We collect a new dataset
by crowd-sourcing difference descriptions for pairs of image frames extracted
from video-surveillance footage. Annotators were asked to succinctly describe
all the differences in a short paragraph. As a result, our novel dataset
provides an opportunity to explore models that align language and vision, and
capture visual salience. The dataset may also be a useful benchmark for
coherent multi-sentence generation. We perform a firstpass visual analysis that
exposes clusters of differing pixels as a proxy for object-level differences.
We propose a model that captures visual salience by using a latent variable to
align clusters of differing pixels with output sentences. We find that, for
both single-sentence generation and as well as multi-sentence generation, the
proposed model outperforms the models that use attention alone.Comment: EMNLP 201
Narrative Text Generation with a Latent Discrete Plan
Past work on story generation has demonstrated the usefulness of conditioning
on a generation plan to generate coherent stories. However, these approaches
have used heuristics or off-the-shelf models to first tag training stories with
the desired type of plan, and then train generation models in a supervised
fashion. In this paper, we propose a deep latent variable model that first
samples a sequence of anchor words, one per sentence in the story, as part of
its generative process. During training, our model treats the sequence of
anchor words as a latent variable and attempts to induce anchoring sequences
that help guide generation in an unsupervised fashion. We conduct experiments
with several types of sentence decoder distributions: left-to-right and
non-monotonic, with different degrees of restriction. Further, since we use
amortized variational inference to train our model, we introduce two
corresponding types of inference network for predicting the posterior on anchor
words. We conduct human evaluations which demonstrate that the stories produced
by our model are rated better in comparison with baselines which do not
consider story plans, and are similar or better in quality relative to
baselines which use external supervision for plans. Additionally, the proposed
model gets favorable scores when evaluated on perplexity, diversity, and
control of story via discrete plan.Comment: Findings of EMNLP 202
- …